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Abstract. The Lasso is a popular statistical tool invented by Robert 
Tibshirani for linear regression when the number of covariates is greater 
than or comparable to the number of observations. The validity of the 
Lasso procedure has been theoretically established under a variety of 
complicated-looking assumptions by various authors. This article shows 
that for the loss function considered in Tibshirani's original paper, the 
Lasso is consistent under almost no assumptions at all. 



1. Introduction 

The Lasso is a penalized regression procedure introduced by Tibshirani [26] 
in 1996. Given response variables yi,---,y n and p-dimensional covariates 
xi, . . . , x n , the Lasso fits the linear regression model 

E(yi | Xj) = /3 ■ x,; 

by minimizing the i 1 penalized squared error 



where (3 = (Pi, . . . , (3 p ) is the vector of regression parameters and A is a 
penalization parameter. As A increases, the Lasso estimates are shrunk 
towards zero. An interesting and useful feature of the Lasso is that it is 
well-defined even if p is greater than n. Not only that, often only a small 
fraction of the estimated Pi's turn out to be non-zero, thereby producing 
an effect of automatic variable selection. And thirdly, there is a fast and 
simple procedure for computing the Lasso estimates simultaneously for all 
A using the Least Angle Regression (LARS) algorithm of Efron et. al. |14j . 
The success of Lasso stems from all of these factors. 

There have been numerous efforts to give conditions under which the 
Lasso 'works'. Much of this work has its origins in the investigations of 
I 1 penalization by David Donoho and coauthors (some of it predating Tib- 
shirani's original paper) [91 QUI [HI 121 [13]. Major advances were made by 
Osborne et. al. [25j . Knight and Fu [19], Fan and Li |15j . Meinshausen and 
Biihlmann [23] , Yuan and Lin [29] , Zhao and Yu [32] , Zou [33J , Greenshtein 
and Ritov [17], Bunea et. al. [5JI2], Candes and Tao [7J[8], Zhang and Huang 
[51] . Lounici |2T], Bickel et. al. [2], Zhang [30], Koltchinskii [20], Wainwright 
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[28] . Bartlett et. al. pQ, and many other authors. Indeed, it is a daunting 
task to compile a thorough review of the literature. Fortunately, this daunt- 
ing task has been accomplished in the recent book of Buhlmann and van de 
Geer [4], to which we refer the reader for an extensive bibliography and a 
comprehensive treatment of the Lasso and its many variants. 

A common feature of most of the above work is that they assume that only 
a small number of the true /3j's are nonzero, and then look for conditions 
under which this set is correctly identified with high probability by the 
Lasso procedure with an appropriate choice of the penalization parameter. 
This quest invariably leads to complicated non-degeneracy conditions on the 
covariance matrix of the covariates. The conditions are usually unverifiable 
or too artificial to hold for real data — and yet, it is known that sometimes 
such conditions are actually necessary for certain kinds of consistency to 
hold [33|. [32l [24"] . The article [27] can serve as a quick reference for the list 
of all prominent assumptions and their inter-relations. 

The main point of this paper is to show that for the loss function con- 
sidered by Tibshirani in |26] (the 'prediction loss'), the Lasso is consistent 
under almost no assumptions beyond the bare minimum required for setting 
up the ordinary least squares regression problem. Results that are similar 
in spirit (but not the same) have appeared recently in the important works 
of Buhlmann and van de Geer [5] and Bartlett et. al. Comparisons will 
be given later. 



2. The setup 

Suppose that X±, . . . , X p are (possibly dependent) random variables, and 
M is a constant such that 

(!) \Xj\<M 

almost surely for each j. Let 

p 

(2) Y = J2^X j + e, 

3=1 

where e is independent of the Xj's and 

(3) e~N(0,a 2 ). 

Here /3J 1 , . . . , /3* and a 2 are unknown constants. 

Let Z denote the random vector (Y, X\, . . . , X p ). Let Zi, . . . , Z n be i.i.d. 
copies of Z. We will write Zj = (Y, X^i, . . . , Xi jP ). The set of vectors 
Zi, . . . , Z n is our data. The conditions ([T|) , <[2j) , d3j) and the independence of 
Zi, . . . , Z n are all that we need to assume in this paper, besides the sparsity 
condition that Y^j=i \ ls n °t too large. 
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3. Prediction error 

Suppose that in the vector Z, the value of Y is unknown and our task 
is to predict Y using the values of X\,,,. ,X p . If the parameter vector 
(3* = (/?*, was known, then best predictor of Y based on Xi, . . . , X p 

would be the linear combination 

3=1 

However (3*, . . . ,/3* are unknown, and so we need to estimate them from 
the data Zi, . . . , Z n . The 'mean squared prediction error' of any estimator 
(3 = . . . ,/3 p ) is defined as the expected squared error in estimating Y 
using 0, that is, 

(4) MSPE(/3) :=E(Y-Y) 2 , 

where 

3=1 

Note that here . . . , f3 p are computed using the data Zi, . . . , Z n , and are 
therefore independent of X%, . . . ,X p . By this observation it is easy to see 
that the prediction error may be alternatively expressed as follows. Let £ 
be the covariance matrix of (X\, . . . ,X p ), and let || • ||s be the norm (or 
seminorm) on MP induced by E, that is, 

II Il2 v 

ll x lls = x ■ 

With this definition, the mean squared prediction error of any estimator /3 
may be written as 

MSPE(/3) =E||/3* -/3|||. 

While this alternative representation of the mean squared prediction error 
may make it more convenient to connect it to, say, the £ 2 loss, the original 
definition Q is more easily interpretable and acceptable from a practical 
point of view. 

As mentioned before, the mean squared prediction error was the measure 
of error considered by Tibshirani in his original paper [26] and also previously 
by Breiman [3] in the paper that served as the main inspiration for the 
invention of the Lasso (see |26|). Although this gives reasonable justification 
for proving theorems about the prediction error of the Lasso, this measure of 
error is certainly not the last word in judging the effectiveness of a regression 
procedure. Indeed, as Tibshirani |26] remarks, "There are two reasons why 
the data analyst is often not satisfied with the OLS [Ordinary Least Squares] 
estimates. The first is prediction accuracy .... The second is interpretation. " 
Proving that the Lasso has a small prediction error will take care of the first 
concern, but not the second. 
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4. Prediction consistency of the Lasso 



Take any K > and define the estimator j3 
minimizer of 



(/3f ,...JK) as the 



n 



Y,(y l - PiXw p p . 




subject to the constraint 



v 



If there are multiple minimizers, choose one according to some predefined 
rule. While this definition of the Lasso is not the same as the one given in 
Section [IJ this is in fact the original formulation introduced by Tibshirani 
in |26j . The two definitions may be shown to equivalent under a simple 
correspondence between K and A, although the correspondence involves 
some participation of the data. 

The following theorem shows that the Lasso estimator defined above is 
'prediction consistent' if K is correctly chosen and n S> log p. This is the 
main result of this paper. 

Theorem 1. Consider the setup defined in Section^ Let K be any constant 



3=1 

Let MSPE stand for the mean squared prediction error, defined in Section^ 
If (3 is the Lasso estimator defined above, then 



Remarks. (1) Close cousins of Theorem [T] have appeared very recently in the 
literature. The two closest results are possibly Corollary 6.1 of Biihlmann 
and van de Geer [1] and Theorem 1.2 of Bartlett et. al. [T]. However, these 
results do not actually give bounds on the mean squared prediction error 
defined in Section [3l Indeed, to the author's knowledge, Theorem Q] is the 
only result till date that gives a bound on the prediction error used by 
Tibshirani |26j and Breiman [3]. The results of [4] and PQ are more closely 
related to the notion of 'persistence' defined in Greenshtein and Ritov |17j . 
See also Foygel and Srebro |16| and Massart and Meynet [22] for some other 
related results. 

(2) The explicit clean bound in terms of K, M, a, n and p is a new 
contribution of Theorem [Tl 

~ K 

(3) Suppose that a given value of K is used to compute the estimate (3 . 
If the true parameter vector (3* does not obey the condition ([5]), we cannot 



such that 



v 



(5) 
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hope that (3 will be a good estimate of (3*. There does not seem to be a 
way to avoid the condition ([5]) . 

(4) Theorem [1] does not give a prescription for choosing an appropriate K. 
But that is a separate problem. One may use, for instance, one of the 
approaches outlined in [26] to choose a value of K. If K is chosen based 
on the data, the error bound has to be recomputed to incorporate this 
knowledge. Theorem [1] can serve as a starting point for such a computation. 

(5) In most papers on the Lasso, it is assumed that all but a small number 
of the /3*'s are zero. Theorem [T] makes no such assumption. 

(6) If K, M and a remain bounded as n and p tend to infinity, the 
only condition required for prediction consistency of the Lasso as given by 
Theorem [1] is that n grows faster than log p. This condition occurs in most 
modern treatments of the Lasso. The logp factor arises due to the Gaussian 
error assumption. Actually, the assumption of Gaussianity is not strictly 
required; Gaussian tail is enough. A different assumption about the error 
would lead to a different factor. 

(7) The uniform boundedness of the covariates is not strictly necessary, 
because M may be allowed to grow slowly with n and p. Similarly, if M 
remains fixed then K can also grow with n and p, as long it grows slower 
than (n/logp) 1 / 4 . 

(8) Theorem [1] may be used to get error bounds for other loss functions 
under additional assumptions. For example, if we assume that the smallest 
eigenvalue of £ is bounded below by some number A, then the inequality 

\\p K -{3*f<\-i\\p K -/3*|||, 

together with Theorem [1] gives a bound on the £ 2 error. Similarly, assuming 
that /3* has only a small number of nonzero entries may allow us to derive 
stronger conclusions from Theorem [H 



5. Estimated prediction error 

Instead of the prediction error defined in Section El one may alternatively 
consider the 'estimated mean squared prediction error' of an estimator (3, 
defined as 

i n 

MSPE09) :=- YiYi-Ytf, 

n — ^ 



n . 



where 



3=1 3=1 

Alternatively, this may be expressed as 



MSPE(/3) = ||/3-/3 



*l|2 
S 
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where 



|x||? = x • Xx, 



and X is the sample covariance matrix of the covariates, that is, the matrix 
whose (j, k)the element is 



1 

— / . Xi jXi k- 

n L — / w 



n 

i=l 



The slight advantage of working with the estimated mean squared prediction 
error over the actual mean squared prediction error is that consistency in 
the estimated error holds if K grows slower than (ra/logp) 1 / 2 , rather than 
(n/logp) 1 / 4 as demanded by the mean squared prediction error. This is 
made precise in the following theorem. 

Theorem 2. Let all notation be as in Theorem [TJ and suppose that ([5]) 
holds. Let MSPE denote the estimated mean squared prediction error, as 
defined above. Then 



E(MSPE03*)) < KMaf-^j^. 

Incidentally, the above theorem is related to the notion of persistence 
defined in [T7] and thoroughly investigated in [Tj. Corollary 6.1 of [I] and 
Theorem 3.1 of [22] are other closely related results. 

6. Proofs of Theorems Q] and [2] 

Let Y := (Yi, . . . , Y n ), and Y K := (Yf , Y*), where 

p 

Yf- := y^fifXjj. 



Similarly, let 



Y-:=J2^ Xj . 

i=i 



For each 1 < j < p, let Xj := (X hj , X nJ ). Finally, let Y := (Y 1} . . . , Y n ), 
where 

v 

% ■= ^2PjXj,j, 

3=1 

Given Zi, . . . , Z n , define the set 

C := {/?iXi + • • • + /3 p X p : \^\ + ■ ■ ■ + \/3 p \ < K}. 

Note that C is a compact convex subset of W 1 . By definition, Y K is the 
projection of Y on to the set C. Since C is convex, it follows that for any 
x G C, the vector x — Y^ must be at an obtuse angle to the vector Y — Y K . 
That is, 

(x - Y K ) ■ (Y - Y K ) < 0. 
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The condition ([5]) ensures that Y € C. Therefore 



(Y — Y ) • (Y — Y ) < 0. 



This may be written as 



|Y - Y 



K\\2 



< (Y - Y) • (Y 

v 



K 



i=i v j=i j 

3=1 K i=l J 



By the condition ([5]) and the definition of /3 , the above inequality implies 
that 



(6) 

where 



|Y - Y^ll 2 < 2K max \U. 



i<j<P 



3\i 



i=l 



Let J- be the sigma algebra generated by (^Q,j)i<i<n, i<j<p- 

Let E- 77 denote 

the conditional expectation given T . Conditional on J 7 , 



^ i=l ' 



Since \X^j\ < M almost surely for all i,j, it follows from the standard results 
about Gaussian random variables (see Lemma [3] in the Appendix) that 



E J '( max \Uj\)< Ma v / 2nlog(2p). 

i<j<p 

Since the right hand side is non-random, it follows that 

E( max \Uj\) < Ma y / 2nlog(2p). 

i<j<p 

Using this bound in ([6]), we get 

(7) E||Y - Y K \\ 2 < 2KMa^2n\og{2p). 

This completes the proof of Theorem [2j For Theorem [H we have to work a 
bit more. Note that by the independence of Z and (3 , 

j,k=i 
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Also, we have 

-IIY - Y K \\ 2 
n 

n v 

= - E E W - - ffi) x ij x i,k- 

8=1 j,k=l 

Therefore, if we define 

1 n 

Vj,k '■= ^(XjX k ) y^XjjXj^, 

then 

E ^(y - Y K f - -\\Y - Y K f =j^{^- Pfm - ffi)V j)k 

j,k=i 

(8) <4K 2 max \V jk \. 

i<j,k<p ' 

Since \E(XjX k ) - XijX itk \ < 2M 2 for all i, j and k, it follows by Hoeffding's 
inequality (see Lemma [5] in the Appendix) that for any /3 € M, 

Consequently, by Lemma 0] from the Appendix, 

E ( „ < 2M y«i. 

i<j,fc<p V n 

Plugging this into ([8]) and combining with (JT]) completes the proof of Theo- 
rem [TJ 



Appendix 

The following inequality is a well-known result about the size of the max- 
imum of Gaussian random variables. 

Lemma 3. Suppose that £j ~ N(0, of), i = 1, . . . , m. The £j 's need not be 
independent. Let L := maxi<j< m Oj. Then 



E( max |£j|) < Ly / 21og(2m). 

KKm 
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Proof. For any p e R, E(e^) = e^?/ 2 < e^^l 2 . Thus, for any /3 > 0, 
E( max |&|) = -E(loge max ^^^l & l) 

l<i<m p 



<i E (k,g|>«- 



The proof is completed by choosing f3 = L 1 a/2 log(2m). □ 

The result extends easily to the maximum of random variables with 
Gaussian tails. 

Lemma 4. Suppose that for i = 1, . . . ,m, £i is a random variable such that 
E(e^) < L I 2 for each j3 € M., where L is some given constant. Then 

E( max < LV21og(2m). 

l<i<m 

Proof. Exactly the same as the proof of Lemma EJ □ 

The following lemma is commonly known as Hoeffding's inequality [18J. 
The version we state here is slightly different than the commonly stated 
version. For this reason, we state the lemma together with its proof. 

Lemma 5. Suppose that rji,...,rj m are independent, mean zero random 
variables, and L is a constant such that \rji\ < L almost surely for each i. 
Then for each (3 € M, 



Proof. By independence, 

m 

E( e /?E™i%) = JjE(e /37?i )- 

i=i 

Therefore it suffices to prove the result for m = 1. Note that 

L 



E(e^) = J e^dmix 



where [i\ is the law of r\\. By the convexity of the map x i— > e^ x , it follows 
that for each x € [— L, L], 

(9) eP* = e^Hl-tK-L)) < te ?L + (1 _ ty-fiL^ 

where 

. . x 1 

t = t(x) = h -• 

K ' 2L 2 

Since E( m ) = 0, therefore / t(x)dfn(x) = 1/2. Thus by ©, E(e /3,?1 ) < 
cosh(/3L). The inequality coshx < e x I 2 completes the proof. □ 
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